Venice is drowning - final report
Abstract
The main objective of the project is to analyze the data of the tide detections regarding the area of the Venice lagoon, producing predictive models whose performances are evaluated on a time horizon ranging from one hour up to a week of forecast.
For this purpose, three models, both linear and machine-learning based, are tested:
- ARIMA (AutoRegressive Integrated Moving Average);
- UCM (Unobserved Component Models);
- LSTM (Long Short-Term Memory).
Datasets
Two datasets are the basis for the project pipeline:
- the “main” dataset contains the tides level measurements (in cm) in the Venice lagoon from a certain reference level, obtained through the use of a sensor, between 1983 and 2018;
- a second dataset holds the information regarding meteorological variables such as rainfall in mm, wind direction in degree at 10 meters and finally wind speed at 10 meters in meters per second in the periods between 2000 and 2019.
The tides level dataset is composed using the single historical datasets made public by the city of Venice, in particular from Centro Previsioni e Segnalazioni Maree. The data regarding the meteorological variables, instead have been provided, on request, by ARPA Veneto.
All the preprocessing operations regarding parsing, inspection and the final union of the cited datasets are available in the following scripts:
- parsing_tides_data allows to perform the construction of the tidal dataset, importing and unifying each single annual dataset;
- inspection contains a series of preliminar inspection of the aformentioned data:
- preprocess_weather_data_2000_2019 contains the preprocessing operations of the weather-related dataset;
- parsing_tides_weather reports a summary of the procedure implemented in order to deal with missing data in the weather dataset, and contains the merging operation producing the final weather dataset.
As a precise choice, due to time-related and computational reasons, only the data ranging from 2010 and 2018 are kept after the preprocessing.
Data inspection
During the preprocessing phase, some descriptive visualizations regarding the main time series are produced in order to inspect its characteristics.
Models
As anticipated, the models created will focus on two areas, one more purely statistical with linear models such as ARIMA and UCM and the other of machine learning, through the definition of an LSTM model. The preparations and implementations of the models will be presented below and finally a section of results will be proposed in which it will be possible to make a rapid comparison between the performance of the models on a test set defined a priori. Referring to that, it is worth highlighting the data used for both areas:
- for the linear models the training set is composed by the last six months of 2018, from July to December;
- for the machine learning one, considering the capacity of handle more data with constant computational time, the training set cover the period between January 2010 and December 2018.
The test set, previously extracted, refer to the last week of December 2018, i.e. from 24/12/2018 23:00:00 to 31/12/2019 23:00:00.
With reference to the linear models, two strategies are implemented: the former consist in integrate the meteorological variables with the lunar motion while the latter in extracting the principal periodic components exploiting oce, an R package that helps Oceanographers do their work by providing functions to read Oceanographic data files.
Regarding the first strategy, after processing the meteorological data as previously mentioned, using the API PyEphem, an astronomy library that provides basic astronomical computations for the Python programming language. Given a date and location on the Earth’s surface, it can compute the positions of the Sun and Moon, of the planets and their moons, and of any asteroids, comets, or earth satellites whose orbital elements the user can provide. In order to track the lunar motion all we have to do is to select the period of interest and the coordinates representing Venice.
The second strategy instead, as anticipated, concerns the principal periodic components extractable from a time series about sea levels in order to be used as regressors for the tides level time serie. The oce package provide a function called tidem able to fit a model in terms of sine and cosine components at the indicated tidal frequencies, with the amplitude and phase being calculated from the resultant coefficients on the sine and cosine terms. Tidem provides the possibility to extract till 69 components but we focused on 8 of them, in particular:
- M2, main lunar semi-diurnal with a period of ~12 hours;
- S2, main solar semi-diurnal (~12 hours);
- N2, lunar-elliptic semi-diurnal (~13 hours);
- K2, lunar-solar semi-diurnal (~12 hours);
- K1, lunar-solar diurnal (~24 hours);
- O1, main lunar diurnal (~26 hours);
- SA, solar annual (~24*365 hours);
- P1, main solar diurnal (24 hours).